Code

Imports

Read the whole dataset and reduce it to what we are interested in

Sort it by ['dobdb_family_id', 'earliest_publn_date']

Reduce to years we are interested in

In appln_abstract and appln_title: Replace NaNs with ' '

Infer our time frame from data

Of every family, keep only the last english, non-nan title and abstract

For Bruno: Of every family, keep only the last english, non-nan title and abstract and also save the respective family ID and year

Get titles and abstracts counts for each year

Write counts in a dataframe and normalise them

Define stopwords, contexts, equivalents, words to replace, and punctuation

Define a function for taking care of key phrases extraction and counting

Define a function for generating LaTeX code

Two more definitions

Results

Titles

Titles - unigrams

Titles - bigrams

Titles - trigrams

Abstracts

Abstracts - unigrams

Abstracts - bigrams

Abstracts - trigrams

Search certain patterns